Prediction of site-specific amino acid distributions and limits of divergent evolutionary changes in 

protein sequences 



Markus Porto, 1 H. Eduardo Roman, 2 '0 Michele Vendruscolo, 3 and Ugo Bastolla 4 

' lnstitut fiir Festkorperphysik, Technische Universitdt Darmstadt, Hochschulstr, 8, 64289 Darmstadt, Germany 
2 Dipartimento di Fisica and INFN, Universita di Milano, Via Celoria 16, 20133 Milano, Italy 
3 Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK 
4 Centro de Astrobiologia (INTA-CSIC), 28850 Torrejon de Ardoz, Spain 
(Dated: March 26, 2004, revised October 15, 2004) 

We derive an analytic expression for site-specific stationary distributions of amino acids from the Structurally 
Constrained Neutral (SCN) model of protein evolution with conservation of folding stability. The stationary 
distributions that we obtain have a Boltzmann-like shape, and their effective temperature parameter, measuring 
the limit of divergent evolutionary changes at a given site, can be predicted from a site-specific topological 
property, the principal eigenvector of the contact matrix of the native conformation of the protein. These analytic 
results, obtained without free parameters, are compared with simulations of the SCN model and with the site- 
specific amino acid distributions obtained from the Protein Data Bank. These results also provide new insights 
into how the topology of a protein fold influences its designability, i.e. the number of sequences compatible 
with that fold. The dependence of the effective temperature on the principal eigenvector decreases for longer 
proteins, a possible consequence of the fact that selection for thermodynamic stability becomes weaker in this 
case. 



I. INTRODUCTION 

The reconstruction of phylogenetic distances from se- 
quence alignments requires the use of a model of protein evo- 
lution. Evolutionary models are also needed for reconstruct- 
ing phylogenetic trees in the framework of maximum likeli- 
hood methods (Felsenstein, 1981). In this context, both the 
mutational process and the selection on protein folding and 
function must be taken into account. It is well known that the 
local environment of a protein site within the native structure 
influences the probability of acceptance of a mutation at that 
site (Overington et al., 1990). Nevertheless, structural biol- 
ogy still has a very limited impact in studies of phylogenetic 
reconstruction, and the models commonly used in these stud- 
ies rely on substitution matrices that do not consider the struc- 
tural specificity of different sites. The most used substitution 
matrices, such as JTT (Jones et al., 1992), are obtained ex- 
trapolating substitution patterns observed for closely related 
sequences, and they have low performances when distant ho- 
mologs are concerned (Henikoff and Henikoff, 1993). 

To take into account the protein-level selection, it is neces- 
sary to consider site-specific amino acid distributions within 
a protein family (Halpern and Bruno, 1998). Similarly, the 
use of site-specific substitution matrices improves substan- 
tially maximum likelihood methods for reconstructing phylo- 
genetic trees (Lio and Goldman, 1998; Koshi and Goldstein, 
1998; Koshi et al., 1999; Thorne, 2000; Fornasari et al., 2002). 

In the studies mentioned above, site-specific constraints are 
obtained either through simulations of a model of protein evo- 
lution or fitting parameters in a maximum likelihood frame- 
work. In this work we derive an analytic expression, without 
adjustable parameters, for site-specific amino acid distribu- 
tions and show that they are in very good agreement with sim- 
ulations of a model of protein evolution and with site-specific 
amino acid distributions obtained from the Protein Data Bank. 



The site-specific evolutionary patterns that we predict are 
associated with a key topological indicator of native state 
topology, namely the principal eigenvector of the contact ma- 
trix (Bastolla et al., 2004; Porto et al., 2004). Our approach is 
based on the Structurally Constrained Neutral model (SCN 
model; Bastolla et al., 1999, 2000a, 2002, 2003a, 2003b), 
where possible mutations are tested for conservation of struc- 
tural stability using a computational model of protein folding 
(Bastolla et al., 1998, 2000b, 2001). This approach is simi- 
lar in spirit to the Structurally Constrained Protein Evolution 
model (SCPE; Parisi and Echave, 2001), that uses a different 
criterion to assess the thermodynamic stability. Notice that 
the SCN model does not assume independence at different 
sites, and in fact we have recently shown that site-dependence 
is linked to a rather debated feature of neutral evolution, the 
overdispersion of neutral substitutions (Bastolla et al., 2003b). 

Site-specific amino acid distributions with Boltzmann form, 
similar to those that we derive here, have already been used in 
other studies of protein evolution, for instance by Koshi and 
Goldstein (1998), Koshi et al. (1999) and Dokholyan et al. 
(2001 ; 2002). As we will discuss, the main difference between 
our approach and these previous models consists in the fact 
that our approach allows to compute analytically the Boltz- 
mann parameters, without the need of any empirical sequence 
family. 

As we have previously shown (Bastolla et al., 2003a), 
the site-specific conservation patterns provided by the SCN 
model, and therefore also by the present analytic formu- 
lation, match qualitatively structural conservation patterns 
found in bioinformatics studies (Ptitsyn, 1998; Ptitsyn and 
Ting, 1999). Further, the site-specific amino acid distribution 
that we present here also enables us to derive an estimate of 
the sequence entropy compatible with a given fold, which is 
termed the 'designability' of the fold (Li et al., 1998; Helling 
et al., 2001). These results are in agreement with recent stud- 
ies suggesting that the designability can be inferred from the 
protein topology alone (Koehl and Levitt, 2002; England and 
Shakhnovich, 2003). 
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H. BACKGROUND 

In the SCN model (Bastolla et al., 2002, 2003a), starting 
from the protein sequence in the Protein Data Bank (PDB), 
amino acid mutations are randomly performed and accepted 
according to a stability criterion based on an effective model 
of protein folding (Bastolla et al., 2000b, 2001). We use a 
coarse-grained representation of protein structures as a binary 
contact matrix Cy with elements equal to one if i and j are 
in contact (at least one pair of heavy atoms, one belonging to 
each amino acid, are less than 4.5 A apart), and zero other- 
wise. The effective free energy for a sequence A in configu- 
ration C is approximated by an effective contact free energy 
function E (A, C), 



£(A,C) 
k B T 



■ YtQjUiAuAj), 



(1) 



where U is a 20 x 20 symmetric matrix with U(a,b) represent- 
ing the effective interaction, in units of Icq T, of amino acids 
a and b when they are in contact; we use the interaction ma- 
trix derived by Bastolla et al. (2001). The free energy func- 
tion, Eq. Q, assigns lowest free energy to the experimentally 
known native structure against decoys generated by thread- 
ing the sequence over protein-like structures derived from the 
PDB. If gaps in the sequence-structure alignment are allowed, 
one has to use a suitable gap penalty term. 

For testing the stability of a protein conformation, we use 
two computational parameters: (i) The effective energy per 
residue, E(A, C) /N, where N is the protein length, which cor- 
relates strongly with the folding free energy per residue for a 
set of 18 small proteins that are folding with two-states ther- 
modynamics (correlation coefficient 0.91; U. Bastolla, un- 
published result); (ii) The normalized energy gap a, which 
characterizes fast folding model sequences (Bastolla et al., 
1998) with well correlated energy landscapes (Bringelson and 
Wolynes, 1987; Goldstein et al., 1992; Abkevich et al., 1994; 
Gutin et al., 1995; Klimov and Thirumalai, 1996). A mu- 
tated sequence is considered thermodynamically stable if both 
computational parameters are above predetermined thresholds 
(Bastolla et al., 2003a). 

The interaction matrix U can be written in spectral form as 
U(a,b) = Y^i=\£o.u( a \a)u( a \b), where £ a are the eigenval- 
ues, ranked by their absolute value, and vS a > are the corre- 
sponding eigenvectors. The main contribution to the interac- 
tion energy comes from the principal eigenvector uy\ which 
has a negative eigenvalue £i < 0, as £i u^\a)u^'(b) has cor- 
relation coefficient 0.81 with the elements U(a,b) of the full 
matrix. It is well known that hydrophobic interactions deter- 
mine the most significant contribution to pairwise interactions 
in proteins, so that the components of the main eigenvector 
are strongly correlated with experimental hydropathy scales 
(Casari et al., 1992; Li et al., 1997). Considering only this 
main component, we define an approximate effective energy 
function H(A,C), yielding a good approximation to the full 
contact energy, Eq. Q, 



*j^= ei £q,A(A,)*(A 7 ). 



(2) 
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We call h(A) = u^(A) the Hydrophobicity Profile (HP) of 



sequence A (Bastolla et al., 2004). This is an A^-dimensional 
vector whose z'-th component is given by h(Af) = u^'(Ai), The 
20 parameters h{a) = u^ x \a) are obtained from the principal 
eigenvector of the interaction matrix, and we call them inter- 
activity parameters. 

The HP provides a vectorial representation of protein se- 
quences. In turn, a convenient vectorial representation of pro- 
tein structures may be obtained by the principal eigenvector of 
the contact matrix C, which we denote by PE and we indicate 
by c. This vector maximizes the quadratic form Y,ijQj c i c j 
with the constraint L, c? = 1. In this sense, c, can be inter- 
preted as the effective connectivity of position i, since posi- 
tions with large c ( - are in contact with as many as possible 
positions j with large Cj. All its components have the same 
sign, which we choose by convention to be positive. More- 
over, if the contact matrix represents a single connected graph 
(as it does for single-domain globular proteins), the informa- 
tion contained in the principal eigenvector is sufficient to re- 
construct the whole contact matrix (Porto et al., 2004). 

For any given protein fold, identified through its PE, we 
can define the optimal HP, h pt, as the HP that minimizes 
the approximate effective free energy, Eq. @, for fixed 
mean and squared mean of the hydrophobicity vector, (lij = 
N- l Y,jh(Ai) and (h 2 ) = A^ 1 1,-/z(A,-) 2 . We impose a condi- 
tion on the mean hydrophobicity, (/z), because, if a sequence 
is highly hydrophobic, not only the native structure but also 
alternative compact structures will have favorable hydropho- 
bic energy. Selection to maintain a large normalized energy 
gap is therefore expected to place constraints on the value of 
(h). 

From the above property of the PE, c, it is clear that the 
optimal HP, h opt , is strongly correlated with the PE (Bastolla 
et al., 2004). In this formulation, (h^ and (h 2 ) are not deter- 
mined by the native structure but depend on the mutation and 
selection process (see below). 

The optimal HP constitutes an analytic solution to the se- 
quence design problem with energy function given by Eq. Q, 
and an approximate solution to sequence design with the full 
contact energy function, Eq. Q. In the SCN evolutionary 
model, attempted mutations are accepted whenever the effec- 
tive free energy and the normalized energy gap overcome pre- 
defined thresholds. Therefore, we do not expect that the op- 
timal HP is ever observed during evolution, but we do expect 
thermodynamically stable sequences compatible with a given 
fold to have HPs distributed around the optimal one. We have 
verified this prediction, finding that the HPs of individual se- 
lected sequences are correlated with the principal eigenvector 
of the protein fold with correlation coefficient of 0.45 (av- 
eraged over seven protein folds and over hundred thousands 
of simulated sequences per fold), whereas the HP averaged 
over all sequences compatible with a given fold, [h] ,, corre- 
lates much more strongly with the principal eigenvector of that 
fold, with mean correlation coefficient 0.96 averaged over the 
same seven folds (Bastolla et al., 2004). These results show 
that one can recover the optimal HP through an evolutionary 
average of the HPs compatible with the protein fold. 

We found a similar result for the protein families repre- 
sented in the PFAM (Bateman et al., 2000) and in the FSSP 
(Holm and Sander, 1996) databases. In this case, however, the 
correlation between the principal eigenvector and the evolu- 
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tionary average of the HP is weaker: The average correlation 
coefficient is 0.57 for PFAM families and 0.58 for FSSP fam- 
ilies (Bastolla et al., 2004). This weaker correlation is not 
unexpected, since functional conservation, which is not ac- 
counted for in the SCN model, plays an important role in pro- 
tein evolution, and the effective energy function that we use 
to test thermodynamic stability is only an approximation. In 
addition, the number of sequences used to calculate the av- 
erage HP in the PFAM and FSSP databases is much smaller 
than the number of sequences obtained by the SCN model. 
When the average HP is computed using only some hundreds 
of SCN sequences, the same order of magnitude as in PFAM 
or FSSP families, the correlation with the principal eigenvec- 
tor also drops to values comparable to those observed for the 
PFAM and the FSSP sequence databases. 



III. RESULTS 
A. Derivation of Jt, (a) 

The results presented above suggest that the SCN model 
of protein evolution can be represented as a trajectory in se- 
quence space where the HP moves around the optimal HP, 
which is strongly correlated with the principal eigenvector of 
the protein fold's contact matrix (Bastolla et al., 2004). In this 
work, we use this analogy to compute analytically the site- 
specific distribution of amino acid occurrences 7C/(a), where 
i indicates a protein position and a indicates one of the 20 
amino acid types. 

Our previous results indicate that the evolutionary average 
of the hydrophobicity vector, [h] j, is correlated with the 
principal eigenvector c of the native contact matrix. In order to 
derive an analytical expression for 7C/(a), we now assume that 
the correlation coefficient between the principal eigenvector 
and the average HP is 1, i.e. r([h] eyol ,c) = « [h]^, • c) - 

( Wevol)( C ))/(^ a Hevo 1 ) = l > aIld We ° btain 



[ /l ']evol^E 7t '(«)K«)=A( C ,7( C )-l) 

{a} 



(3) 



where the sum over {a} is over all amino acids, and 



( P. evol) ( evol) 



\ ((c 2 )-(c) 2 )/(cf 



andB = (Wevol)- (4) 



We are representing here two kinds of average: The angular 
brackets denote the average over the N positions of the pro- 
tein, if) = N^ 1 £///, an d the corresponding standard devia- 
tion is denoted by ai — (/ 2 ) — (/) 2 , whereas square brack- 
ets denote position-specific evolutionary averages, [/] evo] = 
L{a} Ki{a)f{a). 

Eqs. J3l4i are the conditions that the stationary distributions 
Kj(a) have to fulfill in order to guarantee a perfect correla- 
tion between principal eigenvector and the average HP. We 
assume that these conditions are the only requirement that the 
7E/(a) have to meet, so that at stationarity the 71/ (a) are the 
distributions of maximal entropy compatible with the above 
conditions. It is well known that the solution of this optimiza- 
tion problem are Boltzmann-like (exponential) distributions, 



characterized by an effective 'temperature' |p/| _1 that, in this 
context, varies from site to site and measures the tolerance of 
site i to accept mutations over very long evolutionary times, 



71/ (a) 



exp[-|3;/z(a) 



I{«'} ex P[-P*'%')] ' 
with the constraint, Eq. Q, 



(5) 



£exp[-p\-%)] [Ka) -A (c,/(c) - l) -B] = 0. (6) 

M 

Eq. states an analytical relation between the 'Boltz- 
mann parameter' p; and the principal eigenvector component 
c,, given the two evolutionary parameters A and B. One 
sees from this equation that p; equals zero if c,/ (c) = 1 + 
A~ l (Y.{a}h{ a )/20~ (Jhj j)), and p, becomes negative for 
larger c, and positive for smaller c,-. We expect that the re- 
lationship between p, and c; is almost linear in the range 
c// (c) — 1 1 -C 1, whereas p, tends to minus infinity when the 
average hydrophobicity at site i, [A;] v tends to the max- 
imally allowed value, and to plus infinity when the average 
hydrophobicity at site i tends to the minimum allowed value. 

Eq. can be interpreted as follows: (i) Positions with 
large eigenvector component c/ are buried in the core of the 
protein and are therefore with high probability occupied by 
hydrophobic amino-acids (positive h(a)) and have large and 
negative p,, (ii) surface positions with small c, are more likely 
occupied by polar amino acids (negative h(a)) and have large 
and positive p,, and (iii) intermediate positions are the most 
tolerant having a small negative or small positive p,-. 

Similar forms of site-specific amino acid distributions 
have been previously proposed by other authors, see for in- 
stance Koshi and Goldstein (1998), Koshi et al. (1999) and 
Dokholyan et al. (2001; 2002). We will discuss analogies 
and differences between these previous approaches and the 
present one in the last section. 



B. Validation: SCN model 

We first verified these predictions using the simulated tra- 
jectories obtained through the SCN model. We found that 
the site-specific stationary distributions of amino acids, 7C/(a), 
are well fitted by an exponential function of hydrophobicity, 
7C/(a) °z exp[— p,/j(a)], where we use the interactivity scale 
given by the main eigenvector of our interaction matrix. 

Analytically, the expected site-dependent Boltzmann pa- 
rameter p ( - can be calculated in an implicit form by rewriting 
Eq. l|6} as 



;,/<e) = l+A- 



£ M /i(a)exp[-p//i(a)] 
E{«}exp[-P//z(a)] 



B 



, (7) 



giving Ci as a function of p,. To obtain p, as a function of c/, 
Eq. Q has to be inverted numerically, and the parameters A 
and B are obtained from the simulation data by Eq. @. 

The predictions derived in this way, for the ribonuclease 
fold (PDB id. 7rsa), are compared in Fig. ^ to the p, ob- 
tained by fitting the distributions 7C;(a) simulated through the 
SCN model. The prediction does not involve any adjustable 
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FIG. 1: 'Boltzmann parameter' P,- as a function of the scaled princi- 
pal eigenvector component c,/(c) as obtained by the SCN model 
for ribonuclease (PDB id. 7rsa). The line shows the analyti- 
cal prediction, Eq. jTJ, obtained using the mean hydrophobicity 

( M evol) = ?" 108 a " d the ™ Ce ( M Ll) - ( M eve!)' = °- 0077 

as observed in the simulations of the SCN model. The dashed part of 
the curve indicates the forbidden area c,- < 0. The inset examplifies 
the numerical obtained — ln[ji/(a)] vs hydrophobicity h(a) of amino 
acid a, as obtained for ribonuclease at protein position i = 50 (with 
C5o/(c) = 0. 124), yielding p 50 = 4.92. 



Protein 


PDB id. 


N 


P 


R 


r 


rubredoxin (mesophilic) 


liro 


53 


0.98 


0.96 


0.96 


rubredoxin (thermophilic) 


lbrf_A 


53 


0.98 


0.96 


0.97 


SH3 domain 


laey 


58 


0.93 


0.97 


0.95 


cytochrome c 


451c 


82 


0.94 


0.84 


0.84 


ribonuclease 


7rsa 


124 


0.94 


0.83 


0.89 


lysozyme 


31zt 


129 


0.93 


0.73 


0.82 


myoglobin 


la6g 


151 


0.95 


0.94 


0.88 


ubiquitin conjugating enzyme 


lu9a_A 


160 


0.90 


0.88 


0.79 


TIM barrel 


7tim_A 


247 


0.94 


0.88 


0.84 


kinesin 


lbg2 


323 


0.87 


0.79 


0.73 



TABLE I: Correlation coefficients between site-specific quantities 
predicted using the equations derived in this work and observed in 
SCN simulations. Fourth column: 'Boltzmann parameters' p. Fifth 
column: Rigidities R. Sixth column: Substitution rates r. Note that 
sites containing cysteine residues are not included in the calculation 
of the correlation coefficients, as cysteine residues form pairwise 
disulphide bridges (which are very poorly represented through hy- 
drophobicity) and are strictly conserved in our evolutionary model. 



parameter, since A and B are calculated from ([^] evol ) and 

( M evol) as determined by the results of the SCN model. The 
latter values do not depend on the native structure, but they 
depend generally on the mutation and selection process, and 
specifically on its realization as simulated through the SCN 
model. The other proteins studied show a similar behavior, 
and the mean correlation coefficient between predicted and 
observed p, is 0.935 (cf. Table Q. 

One notices from Fig. ^ that P; depends almost linearly 
on the scaled principal eigenvector component c,/ (c\ over a 
wide range of values. The slope of this linear part, the deriva- 
tive — 8p/3c taken at c = c/(c) = 1, scales with length N, 
and it can be well fitted by the expression Ao +A\/N with 
A =2.50 ±0.21 andAi = 104.8 ± 9.2 for the proteins listed 
in Table 12 see Fig.|2ja). One sees from this expression that, 
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FIG. 2: (a) Plot of slope — 3p/9c, with c = c/lc) obtained at c = 1, vs 
inverse chain length 1 /N. The line shows a fit of the form Ao + A\ /N. 
(b) Plot of the standard deviation of the hydrophobicity, Owi^, vs 
inverse chain length l/N as obtained by the SCN model. The line 
shows a fit of the form Bq + B\ /N. 



as the protein length increases, the Boltzmann parameter p, 
becomes less dependent on c,/ (c) and therefore more homo- 
geneous. This result, at first sight unexpected, can be better 
understood considering Eq. 0, which implies that the deriva- 
tive of p,- with respect to c = Ci/(c) is proportional to the stan- 
dard deviation of the hydrophobicity, 



dc 



p=o 



A 



( \h\ evol) ( A. evol) 



««*) -(cf) /(cf 



(8) 



The standard deviation of the hydrophobicity, measured from 
simulations of the SCN model, decreases as a function of 
chain length as Bq + B x /N with B = 0.068 ± 0.016 and Bi = 
2.55 ±0.17, see Fig.|3b). Thus, site-specificity decreases for 
longer proteins because the evolutionary average of the HP 
becomes more homogeneous across different positions. This 
result can be explained by noticing that not only the stan- 
dard deviation, but also the mean hydrophobicity, ([A] j), 
decreases as a function of chain length (notice again the two 
kinds of average). So, the result that site-dependent Boltz- 
mann parameters become more homogeneous for longer pro- 
teins ultimately depends on the fact that longer proteins tend 
to be less hydrophobic. 



C. Validation: PDB structures 

We compared our predictions to site-specific distributions 
obtained from a representative subset of the Protein Data 
Bank (PDB). We considered a non-redundant subset of single- 
domain globular proteins in the PDB, with a sequence iden- 
tity below 25% (Hobohm and Sanders, 1994). Globularity 
was verified by imposing that the fraction of contacts per 
residue was larger than a length dependent threshold, N c /N > 
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3.5 +7.8A^ _1//3 . This functional form represents the scaling 
of the number of contacts in globular proteins as a function 
of chain length (the factor N~ 1 / 3 comes from the surface to 
volume ratio), and the two parameters were chosen so as to 
eliminate outliers with respect to the general trend, which 
are mainly non-globular structures. The condition of being 
single-domain was verified by imposing that the normalized 
variance of the PE components was smaller than a threshold, 

(l -N(c) 2 )/(n(c) 2 ) < 1.5. Multi-domain proteins havePE 
components which are large inside their main domains and 
small outside them. The PE components would be exactly 
zero outside the main domains if the domains do not share 
contacts. Therefore, multi-domain proteins are characterized 
by a larger normalized variance of PE components with re- 
spect to single-domain ones. We have verified that the thresh- 
old of 1 .5 is able to eliminate most of the known multi-domain 
proteins and very few of the known single-domain proteins 
(data not shown). 

We thus selected 774 sequences of various lengths, 404 of 
which were shorter than 200 amino acids. We first considered 
only this subset of short proteins, and then the whole data set, 
divided in bins of similar lengths. We counted the number of 
each of the 20 amino acids as a function of c,-/(c), where (c) 
denotes the average over a single structure. We used a bin-size 
of 0.05 for c;/ (c) < 2.5 and a bin-size of 0. 1 for c,/ (c) > 2.5. 
Then, for each bin of c,/<^c), we fitted the observed distri- 
butions Jt c .^ c ^(a) with an an exponential function of the hy- 
drophobicity parameters, n c .i/ c \(a) exp[— p,.../^ h(aj\. As 
in the case of the SCN simulations, we used the interactivity 
scale derived from our effective free energy function. The ex- 
ponential fit was sufficiently good, and yielded the observed 
Boltzmann parameters as a function of the normalized PE 
components. 

Next we calculated the predicted Boltzmann parameters 
through the equation: 



i/(c) = 1 +A- 



E{4%)exp[-|3 c ./ <c )%)] 



E{ a }exp[-P Ci / (c )%)] 
with the parameters A and B now given by 



B 



(9) 



A = 



(MpDb)c;/(c) (WpD B ) c; / (c) 



md$=([h] mB ) ct/{c) , 



I PDB 



(10) 

where the square brackets [/i] PDB now denote, instead of 
the evolutionary average over a protein family, the average 
over all positions with fixed c,/ (c), even belonging to differ- 
ent structures, whereas angular brackets, ( Mpdb)c-/(c)' c ' e " 
note the average over all values of Cf/(c). The denomina- 
tor [((c 2 ) — ( C ) 2 )/( C ) 2 ] PDB indicates the quantity ((c 2 ) — 

(c) 2 ) I (c) 2 , obtained individually for each structure, averaged 
over the whole set of structures. 

The observed Boltzmann parameters are compared in Fig. [3] 
to the predictions of Eq. (|9j- Notice that the agreement is 
indeed remarkable, as the predictions do not involve any ad- 
justable parameter, since A and B are calculated from the PDB 
data. 

We stated in the previous section that, with SCN data, the 
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FIG. 3: 'Boltzmann parameter' P c ./( c ) as a function of the scaled 
principal eigenvector component c;/(c) as obtained by analyzing 
a subset of 404 non-redundant single-domain globular structures of 
the PDB. The line shows the analytical prediction, Eq. |9j, obtained 
using the mean hydrophobicity ([^] PDB ) =0.128 and the variance 

( M pdb ) ~~ ( M PDB ) 2 = 0-009 as obtained from this set. The dashed 
part of the curve indicates the forbidden area c,- < 0. The inset 
examplifies the numerically obtained — ln[7t(a)] vs hydrophobicity 
h(a) of amino acid a, as obtained for c,/(c) £ [2.45,2.5], yielding 
B = -4.53. 
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FIG. 4: Plot of slope — 3(3/3c, with c = c/lc) obtained at c = 1, vs 
inverse chain length 1 /N, obtained for structures of different length 
in the PDB. 



dependence of the Boltzmann parameters on the PE com- 
ponents becomes weaker, and the amino acid distributions 
become more homogeneous, for longer chains. We tested 
whether this interesting observation also holds for site-specific 
distributions obtained from the PDB. To this end, we divided 
our set of proteins in five bins of proteins with less than 100, 
200, 300, 400, and 500 amino acids, respectively, and we mea- 
sured the dependence of the Boltzmann parameters on the PE 
component for each bin. From these data, we obtained the 
derivative of the Boltzmann parameters at c = c,/ (c) = 1 sim- 
ply by fitting a straight line to p (c) for c in the range between 
0.5 and 1.5. For all lengths considered, the fit was reasonably 
good, with correlation coefficients always larger than 0.95. 
The obtained slopes are plotted against 1 /N in Fig. |4] Al- 
though the errors are rather large, the data is compatible with 
the functional form — 3p/3c 1=3 A' +A\/N. The parameters 
that we obtained by fitting this relationship are also compati- 
ble, within error bars, with those obtained from SCN data: We 
find, for PDB structures, A' = 2.33 ±0.17 and A\ = 117 ±25. 
While a more careful study is still necessary on this issue, the 
present results seem to confirm this length effect on the amino 
acid distributions. 
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D. Conservation and designability 

The results that we present imply that there is a direct rela- 
tionship between a structural indicator, the principal eigenvec- 
tor of the contact matrix, and site-specific measures of long- 
term evolutionary conservation and hence of limits in diver- 
gent evolutionary changes. This relationship also provides a 
link between the topology of a fold and its designability. 

One convenient measure of the amino acid conservation at 
a given position is given by the rigidity, defined as 



** = I>(a) 

{a} 



lM ex P[- 2 M(«)] 
{L{ fl }exp[-p\/z(a)]} 2 



(11) 



Rj = 1 means that the same amino acid is present at position 
i in all sequences, i.e. the conservation is total and p^ 1 = 0. 
In general, the rigidity decreases with increasing temperature 
|Pi| . One can use Eqs. ( 1711 11 to compute the rigidity from 
the principal eigenvector. However, the dependence becomes 
clearer when fitting directly the rigidity as a function of the 
principal eigenvector component. We did this on the data 
generated through the SCN model, finding that the best fit 
is parabolic and the three parameters depend on the protein 
length. Instead of the PE component c,-, we use the rescaled 
variable x, = \/Nci, which is normalized such that its mean 
squared value is one, i.e. (x 2 ) = Af~ 1 £,x? = 1. Using this 
scaled variable, the rigidities of all proteins in our data set can 
be fitted by the expression 
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with A = 0.0630 ±0.0017, B = 0.28 ±0.16, C = 1.879 ± 
0.102, D = -5.36±0.89, and d = 0.47±0.11. The factor 
N~ d w N~ l/!2 can be interpreted as a further indication that 
rigidities tend to become more homogeneous for larger pro- 
teins, as previously noted concerning the Boltzmann param- 
eters. Eq. fl!2i predicts the rigidity for all the proteins that 
we studied with the SCN model, with an average root mean 
square relative error of around 10% and an average correla- 
tion coefficient between predicted and observed rigidities of 
0.88 (cf. Table [Q. These values change very little in leave- 
one-out tests, where we discard one protein for estimating the 
parameters and then use it as a blind test. 

A standard information-theoretic measure of site-specific 
sequence conservation is given by the entropy of the amino 
acid distribution, 



- £ Ki(a) log [%i(a)} = log [Z(p,-)] + Pi [hi 

{"} 



evol J 



(13) 



where Z(p,) = L„exp(— p,/z(a)). The entropy attains its max- 
imum value, Si = log(20), at P, = and it decreases with in- 
creasing |P,|. Predictions of the entropy based on a different 
approach, Eq. dl4> . using aligned protein families have been 
obtained by Dokholyan et al. (2001; 2002). 

An important property of the entropy is that its exponen- 
tial, exp(S,-), provides an estimate of the average number of 
amino acid types acceptable at position i over very long evo- 
lutionary times. The exponential of the sum of all site-specific 
entropies, exp(£,-S,-), gives an estimate of the sequence space 
compatible with a given fold, termed the designability of the 



fold, where we are assuming that the amino acid distributions 
at different positions are independent. Although this assump- 
tion is clearly oversimplified, the estimate of designability that 
can be obtained should be a valuable approximation, and our 
approach allows to connect it explicitly to a topological fea- 
ture of the protein native structure (Koehl and Levitt, 2002; 
England and Shakhnovich, 2003). 



IV. DISCUSSION AND CONCLUSIONS 

We have predicted analytically that amino acids at specific 
positions are distributed according to a Boltzmann law. In 
the latter, the role of energy is played by the hydrophobicity, 
measured through an interactivity scale, and that of temper- 
ature by a quantity strongly correlated with a structural indi- 
cator, namely the corresponding component of the principal 
eigenvector of the native contact matrix (PE). This prediction, 
which does not involve any free parameter, is in very good 
agreement both with simulations of the SCN model of protein 
evolution and with site-specific amino acid distributions that 
we obtained from a representative subset of single-domain 
globular proteins in the PDB. 

The relationship between a structural profile (the PE) and an 
evolutionary profile (the Boltzmann parameters) that we find 
has an interesting interpretation. Positions with a large princi- 
pal eigenvector component are strongly interacting with other 
positions in the core of the protein. Therefore, they preferen- 
tially host strongly hydrophobic amino acids (large negative 
P,). On the other hand, positions with small PE are contained 
in loops and, with higher probability, they host polar amino 
acids (large positive p,). 

Despite this general interpretation, we have verified, us- 
ing the SCN model of protein evolution, that the PE has the 
strongest predictive power among several similar structural in- 
dicators that quantify the difference between core and surface 
positions of a protein. As alternative structural indicators, we 
considered the total number of contacts, the number of long- 
range contacts (separated by more than 10 residues along the 
sequence) and the contact order (the average loop length of 
each contact involving a given site), and we correlated them 
with measures of site-specific conservation, but we always 
found a correlation significantly weaker than for the PE. 

The distributions derived here refer to very long evolution- 
ary times, when memory of the starting sequence has been 
lost. We recall the assumptions that we made for deriving the 
site-specific distributions. The first assumption is that selec- 
tion on folding stability can be represented effectively as a 
maximal correlation between the hydophobicity profile (HP) 
of sequences compatible with a given fold and the optimal HP 
of that fold, the latter basically coinciding with the PE. This 
assumption follows directly from an approximation of our ef- 
fective free energy function with its principal (hydrophobic) 
component. The second assumption is that the average of the 
HP of selected sequences over very long evolutionary times 
has a correlation coefficient of unity with the PE, i.e. all other 
energetic contributions average out. The third assumption is 
that this correlation is the only relevant property of the site- 
specific amino acid distributions, in other words, these distri- 
butions are the distributions of maximal entropy whose site- 
specific averages have correlation of one with the PE, thus 
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fulfilling the stability requirement. From these three assump- 
tions the Boltzmann form of the amino acid distributions is 
straightforward. In order to compute the site-specific Boltz- 
mann parameters, however, we still have to know the posi- 
tional mean and standard deviation of the site-specific HPs. 
These quantites are determined by the mutation process and 
by the selection parameters. They can be computed directly 
from the data, in such a way that the analytic prediction does 
not contain any free parameter. 

Boltzmann distributions, as those proposed in this work, 
have a long history in studies of protein structure and evo- 
lution. Structural properties of native protein structures, as 
for instance amino acid contacts, have been assumed to be 
Boltzmann-distributed (Miyazawa and Jernigan, 1985), and 
Boltzmann statistics for structural elements was predicted in 
stable folds of globular proteins (Finkelstein et al., 1995). Our 
work may point out to an alternative or complementary expla- 
nation for such distributions. 

Shakhnovich and Gutin (1993) proposed a model of se- 
quence design through Monte Carlo optimization, which pro- 
duced a Boltzmann distribution in sequence space. A mean 
field approximation of this model (Dokholyan et al. 2001; 
2002) results in site-specific amino acid distributions of the 
form 

JitfflJocexpt-PMa)], (14) 

formally similar to Eq. Q. There are, however, three impor- 
tant differences between our formulation and Eq. (I14> . First, 
Eq. (II 41 was derived as an independent site approximation 
to a Boltzmann distribution for entire sequences, whereas we 
derived Eq. Q from the relationship between average hy- 
drophobicity at a given site and the PE component. Second, in 
Eq. (I14> . the Boltzmann parameter p is the same for all sites, 
whereas we obtain the p, profile along the protein structure 
from the PE. Third and most important, in order to compute 
Eq. {ID, Dokholyan et al. (2001; 2002) used aligned fami- 
lies of natural proteins, whereas our computation only needs 
the PE and two empirical values, the average and the standard 
deviation along the hydrophobicity profile. 

Koshi and Goldstein (1998) and Koshi et al. (1999) as- 
sumed site-dependent Boltzmann distributions of physico- 
chemical amino acid properties, deriving from it a protein evo- 
lution model that they use for phylogenetic reconstruction in 
a maximal likelihood framework. Since the properties they 
used are hydrophobicity and amino acid size, their proposed 
distributions are formally a general case of those derived in 
this work. There are, however, two important differences be- 
tween their approach and ours: First, in our approach sites are 
structurally classified according to the PE component, which 
is a structural indicator very strongly correlated with conser- 
vation; Second, whereas the Boltzmann parameters are treated 
as free parameters by Koshi et al., and fitted in a maximal like- 
lihood framework, in our approach they are computed analyt- 
ically. 

Kinjo and Nishikawa have very recently pointed out the ex- 
istence of a strong relationship between hydrophobicity and 
the main eigenvector of substitution matrices derived from 
protein alignments with various values of the sequence iden- 
tity of the aligned proteins (Kinjo and Nishikawa, 2004). They 
looked at the eigenvector corresponding to the largest eigen- 
value (in absolute value) of the substitution matrices. For high 



sequence identities (above 35 percent), this eigenvector indi- 
cates the propensity of the amino acid to mutate over short 
evolutionary times (mutability). For low sequence identities 
(below 35 percent), corresponding to long evolutionary times, 
this eigenvector is very strongly correlated with hydrophobic- 
ity. This correlation is easily understood in light of the re- 
sult presented here. In fact, Kinjo and Nishikawa used the 
Henikoff's method (Henikoff and Henikoff, 1992) for deriv- 
ing substitution matrices from observed frequencies of aligned 
amino acids at positions with various PE values. Using 
our notations, these substitution matrices can be indicated as 
M(a,b) w log [(7i;(a)7t,'(fe))/(7l ! -(a))(jt ; '(^))], where the an- 
gular brackets denote positional average. In other words, these 
substitution matrices measure the tendency of two residue 
types a and b to co-occur at the same sites. The relationship 
between large time substitution matrices and hydrophobicity 
gives therefore independent support to our main result. 

In dealing with protein families generated from natural evo- 
lution, our approach has two main difficulties: First, it can 
not take into account functional conservation; second, a large 
number of sequence pairs separated by very long evolutionary 
time is required. Because of these difficulties, the site-specific 
conservation predicted by our model agrees only qualitatively 
with site-specific conservation observed in aligned protein 
families (Bastolla et al., 2003a). For tackling these difficulties, 
we tested our analytic predictions on a large set of proteins in 
the Protein Data Bank. Positions belonging to different pro- 
tein folds were counted together in the same structural class 
characterized by the value of the normalized PE. 

The agreement between the analytic prediction and the ob- 
served distributions may be surprising in view of the fact 
that Eq. l|9} takes into account neither the genetic code nor 
the codon usage. It is known that the observed frequencies 
of amino acids correlate strongly with their number of syn- 
onymous codons, and even more strongly with their cumu- 
lative codon frequencies expected in the simple hypothesis 
that the three bases constituting each codon are independently 
distributed (Sueoka, 1961). Moreover, the frequency of nu- 
cleotides at coding positions correlate strongly with the fre- 
quency of nucleotides at non-coding positions such as the syn- 
onymous 3rd codon positions (Bernardi and Bernardi, 1986), 
which are thought to reflect the mutational process. On the 
other hand, mutations are neither adequately represented in 
the SCN model, which is a model at the amino acid level 
where all amino acid mutations are equiprobable. 

The most straightforward way to include the biases of 
the mutational process is to modify Eq. (|5|l with 7t,(a) 
w(a)exp[— P; h(a)], where the weights w(a) are either the 
number of synonymous codons or their cumulative frequency, 
estimated using average nucleotide compositions. We will ad- 
dress this issue in future work. 

We also observed a length effect in the dependence of the 
Boltzmann parameters p, on the PE. For longer proteins this 
dependence becomes weaker, approximately following a law 
of the type Aq + A\/N. This result holds for SCN simulations 
and, although less significantly, for frequency distributions 
obtained from the PDB. Formally, the result directly follows 
from the fact that the evolutionary average of the hydropho- 
bicity, [Aj] p varies less and less across different positions 
i for longer proteins. This length dependence of the HP is 
consistent with the fact that the position average of [Aj] j 



8 



decreases with chain length, i.e. the length of a protein influ- 
ences its composition (White, 1992; Bastolla and Demetrius, 
submitted), such that longer proteins tend to become less hy- 
drophobic. (We note, however, that these results were ob- 
tained using the interactivity scale, and they can not be sig- 
nificantly generalized to other hydropathy scales.) These re- 
sults can be understood considering that longer proteins are 
stabilized by a larger number of interactions per residue. Se- 
lection for protein stability is therefore expected to be weaker 
on individual interactions. In support of this prediction, the 
average interaction energy per contact in PDB structures was 
found to decrease with chain length (Bastolla and Demetrius, 
submitted). 

The site-specific amino acid frequencies that we derived 
may find applications in the estimation of site-specific sub- 
stitution matrices (Li 6 and Goldman, 1998; Thorne, 2000; 
Fornasari et al., 2002), and site-specific structural conserva- 
tion. Our approach may provide analytic understanding of 
the limits imposed by structural and functional contraints to 
the divergent evolution of protein sequences and of the relia- 
bility of phylogenetic reconstructions from protein sequences 
for anciently diverged taxa (Meyer et al., 1986). A better un- 
derstanding of structural conservation may also improve the 



prediction of functional conservation at sites that appear more 
conserved than expected on a structural basis alone (Casari 
et al., 1995; Ota et al., 2003). The dependence of amino acid 
distributions on the PE that we presented here is also a step to- 
wards determining the structural determinants of protein 'des- 
ignability' (Li et al., 1998; Helling et al., 2001; England and 
Shakhnovich, 2003), namely the volume of sequence space 
compatible with a given fold. 
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